Tectogrammatical representation: towards a minimal transfer in machine translation
نویسنده
چکیده
The Prague Dependency Treebank (PDT, as described, e.g., in (Hajič, 1998) or more recently in (Hajič, Pajas and Vidová Hladká, 2001)) is a project of linguistic annotation of approx. 1.5 million word corpus of naturally occurring written Czech on three levels (“layers”) of complexity and depth: morphological, analytical, and tectogrammatical. The aim of the project is to have a reference corpus annotated by using the accumulated findings of the Prague School as much as possible, while simultaneously showing (by experiments, mainly of statistical nature) that such a framework is not only theoretically interesting but possibly also of practical use. In this contribution we want to show that the deepest (tectogrammatical) layer of representation of sentence structure we use, which represents “linguistic meaning” as described in (Sgall, Hajičová and Panevová, 1986) and which also records certain aspects of discourse structure, has certain properties that can be effectively used in machine translation1 for languages of quite different nature at the transfer stage. We believe that such representation not only minimizes the “distance” between languages at this layer, but also delegates individual language phenomena where they belong to whether it is the analysis, transfer or generation processes, regardless of methods used for performing these steps.
منابع مشابه
Czech-English Dependency-based Machine Translation
We present some preliminary results of a Czech-English translation system based on dependency trees. The fully automated process includes: morphological tagging, analytical and tectogrammatical parsing of Czech, tectogrammatical transfer based on lexical substitution using word-to-word translation dictionaries enhanced by the information from the English-Czech parallel corpus of WSJ, and a simp...
متن کاملCzech-English Dependency Tree-based Machine Translation
We present some preliminary results of a Czech-English translation system based on dependency trees. The fully automated process includes: morphological tagging, analytical and tectogrammatical parsing of Czech, tectogrammatical transfer based on lexical substitution using word-to-word translation dictionaries enhanced by the information from the English-Czech parallel corpus of WSJ, and a simp...
متن کاملTowards English-to-Czech MT via Tectogrammatical Layer
We present an overview of an English-to-Czech machine translation system. The system relies on transfer at the tectogrammatical (deep syntactic) layer of the language description. We report on the progress of linguistic annotation of English tectogrammatical layer and also on first end-to-end evaluation of our syntax-based MT system.
متن کاملAutomatic Alignment of Czech and English Deep Syntactic Dependency Trees⋆
In this paper, we focus on alignment of Czech and English tectogrammatical dependency trees. The alignment of deep syntactic dependency trees can be used for training transfer models for machine translation systems based on analysis-transfer-synthesis architecture. The results of our experiments show that shifting the alignment task from the word layer to the tectogrammatical layer both (a) inc...
متن کاملTreebanks in Machine Translation
We present an approach using treebanks in machine translation. Our experiment in Czech-English machine translation is an attempt to develop a full machine translation system based on dependency trees (Dependency Based Machine Translation, DBMT). We use the following resources: Prague Dependency Treebank, a newly created Czech-English parallel corpus of Penn Treebank, English monolingual corpus,...
متن کامل